Data-Driven Model

Take the weather records and forecasts provided for the Data Expo


  • 111 “expo sites” Provided by Data Expo in continential USA

  • National Weather Service (NWS) forecasts provided

  • Want to forecast weather from September 1st, 2016 to September 1st, 2017

  • Need more information in order to create a forecast

  • Example: Build a one day out forecast for Baltimore, MD

Use the NCDC records as a training set


  • ~30,000 weather stations from the National Climatic Data Center (NCDC)

  • Use all data before September 1st, 2016 as the training set

  • New problem: excess of predictors leads to noise and dimensionality issues

  • Need a method for variable selection

Find the 200 strongest cross-PACF values


  • Lag the NCDC data by 1 to 4 days (1 day in this example)

  • Take the 200 stations with the strongest Cross-PACF (lag 1 cross-PACF shown)

  • Cross-PACF finds the most relevant stations while accounting for smaller lag correlations

Build a density-based hull around the chosen stations


  • We dont want to take the 200 sites alone, potential noise/variance issues

  • Two dimensional density hull around the points

  • This allows us to select only the influential regions of the country

Use all weather stations within the hull to train a model


  • Use all NCDC stations within the hull as the training set

  • Build ridge regression model

  • Were considering LASSO, but we have already performed model selection

  • Compare Data Driven forecast to NWS forecast, September 2016-2017

Lag 2 Hull Example


  • Lag 2 Hull Example Selected sites

  • Use all NCDC stations within the hull as the training set

  • Build ridge regression model

Maximum Temperature Forecast

Column

Summary of Absolute Forecast Error

Measure Avg Std Min Q1 Med Q3 Max
NWS Forecast Abs. Error 3.0658 2.6252 0.0000 1.0000 3.0000 4.0000 17.00
Data Driven Abs. Error 4.5643 3.5070 0.0115 1.9134 4.0288 6.2068 20.67
Error Difference 1.4984 3.7751 -12.9181 -0.5721 1.2583 3.4627 14.67

Distribution of Forecast Error

Column

Time Series of Forecast with Observed

Time Series of Forecast comparison

Minimum Temperature Forecast

Column

Summary of Absolute Forecast Error

Measure Avg Std Min Q1 Med Q3 Max
NWS Forecast Abs. Error 4.7273 3.4719 0.0000 2.0000 4.0000 6.5000 17.0000
Data Driven Abs. Error 3.1216 2.4693 0.0152 1.2914 2.5373 4.3453 14.7991
Error Difference -1.6057 3.5674 -13.7842 -3.6986 -1.2547 0.6725 10.7991

Distribution of Forecast Error

Column

Time Series of Forecast with Observed

Time Series of Forecast comparison

Precipitation Forecast

Column

50% Confusion Matrix for Data Driven Forecast

Weather Predicted Dry Predicted Wet
Dry 0.649 0.069
Wet 0.160 0.122

50% Confusion Matrix for NWS Forecast

Weather Predicted Dry Predicted Wet
Dry 0.687 0.031
Wet 0.100 0.182

Column

Comparison of Forecast

NWS Forecast

One Day Out

Two Days Out

Three Days Out

Four Days Out

DD Forecast

One Day Out

Two Days Out

Three Days Out

Four Days Out

Comparison

One Day Out

Two Days Out

Three Days Out

Four Days Out

Minimum Temperature

One Day Out

Two Days Out

Three Days Out

Four Days Out

Precipitation

One Day Out

Two Days Out

Three Days Out

Four Days Out